UPSTREAM PR #16636: vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp #78

DajanaV · 2025-11-04T13:42:31Z

This PR adds for cache_a and cache_b to load an additional vec2, and increases BK=32 for non-CM mul_mm.comp

Performance Comparison (Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti

Kernel	Before(us/run)	After(us/run)	Δ %
`MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5767.79	5176.01	+10.26%
`MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5355.88	4105.95	+23.34%
`MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5219.90	5432.22	-4.07%
`MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	2722.40	2732.62	-0.38%
`MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	2743.99	2753.02	-0.33%
`MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	2843.99	2850.78	-0.24%
`MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	2840.88	2841.73	-0.03%
`MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	2853.15	2857.24	-0.14%
`MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4327.78	4334.87	-0.16%
`MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4306.28	4289.52	+0.39%
`MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4751.79	4781.23	-0.62%
`MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4748.76	4785.89	-0.78%
`MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5155.43	5164.14	-0.17%
`MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4900.78	4914.74	-0.28%
`MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4318.07	4371.76	-1.24%
`MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4643.73	4815.24	-3.69%
`MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5250.76	5015.61	+4.48%
`MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4348.33	4388.21	-0.92%
`MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4821.34	4570.77	+5.20%
`MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5646.37	5633.01	+0.24%
`MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4229.37	4240.83	-0.27%
`MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4339.20	4358.97	-0.46%
`MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	4724.33	4779.14	-1.16%

Performance Comparison (Without coopmat and coopmat2) AMD Radeon RX 7800 XT

Kernel	Before(us/run)	After(us/run)	Δ %
`MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	8873.61	5853.29	+34.04%
`MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	6458.76	5747.87	+11.01%
`MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	7124.22	7401.83	-3.90%
`MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	3289.51	3318.63	-0.89%
`MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	3499.61	3527.61	-0.80%
`MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	3424.27	3446.08	-0.64%
`MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	3707.70	3732.88	-0.68%
`MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	3747.02	3767.69	-0.55%
`MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	6160.74	6393.07	-3.77%
`MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5936.61	6047.77	-1.87%
`MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	7717.80	7037.06	+8.82%
`MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	8219.73	8849.61	-7.66%
`MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	7289.05	7447.10	-2.17%
`MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	7668.33	6923.90	+9.71%
`MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5797.82	5618.78	+3.09%
`MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5764.74	5403.05	+6.27%
`MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5695.78	5998.68	-5.32%
`MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	6074.55	5980.28	+1.55%
`MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5571.36	5367.69	+3.66%
`MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5704.28	5651.10	+0.93%
`MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	6416.39	5307.34	+17.28%
`MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	5968.62	5845.84	+2.06%
`MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)`	8289.75	7982.64	+3.70%

Signed-off-by: Stefan Savic <[email protected]>

loci-agentic-ai · 2025-11-04T14:21:54Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of llama.cpp versions 478f0f77-4a6a-4412-93cb-65be885f7ec3 vs a98c0b17-e20d-4b11-8978-6d6d10c53020, the changes represent Condition 1: No meaningful performance impact.

Overview

The analysis reveals minimal performance variations with no functional code modifications in core inference components. The highest measured change was a 0.17% throughput increase in a C++ standard library constructor, while critical LLM functions remain unchanged.

Key Findings

Performance Metrics:

Highest Response Time Change: std::vector<llm_bigram_spm>::pop_back() improved by 0.10% (0.067 ns reduction from 67 ns to 67 ns)
Highest Throughput Change: std::_Optional_base constructor degraded by 0.17% (0.040 ns increase from 24 ns to 24 ns)

Core Function Impact:
No changes detected in critical inference functions:

llama_decode() - unchanged
llama_encode() - unchanged
llama_tokenize() - unchanged
llama_model_load_from_file() - unchanged

Tokens Per Second Impact:
Zero impact on inference performance. The measured changes occur in auxiliary STL functions unrelated to the tokenization/inference pipeline. Core functions responsible for token processing remain identical between versions.

Power Consumption Analysis:
Negligible power consumption changes across all 15 binaries:

libllama.so: 280,665 nJ (effectively unchanged)
llama-cvector-generator: 314,116 nJ (effectively unchanged)
All GGML libraries maintain identical power profiles

Flame Graph & CFG Analysis:

Identical assembly code: Byte-for-byte identical instructions in analyzed functions
No structural changes: Control flow graphs show identical branching patterns
Performance variance: The 0.06 ns improvement reflects measurement noise rather than code optimization

GitHub Code Review (PR #78):
The actual code changes target Vulkan GPU compute shaders for matrix multiplication optimization, completely separate from the CPU-based functions showing performance variations. PR #78 introduces:

Enhanced F32/F16 matrix operations with up to 34% GPU performance improvements
No impact on CPU inference pipeline or tokenization components

Conclusion:
The measured performance differences represent normal measurement variance rather than functional improvements. Core LLM inference capabilities remain unchanged, with zero impact on production workloads.

Using only for f16 and f32

40c7031

Signed-off-by: Stefan Savic <[email protected]>

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 13:42 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 27 times, most recently from 44faeaa to d7421a0 Compare November 8, 2025 09:08

loci-dev force-pushed the main branch 30 times, most recently from 9d00b69 to c481809 Compare December 10, 2025 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16636: vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp #78

UPSTREAM PR #16636: vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp #78

Uh oh!

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UPSTREAM PR #16636: vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp #78

Are you sure you want to change the base?

UPSTREAM PR #16636: vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp #78

Uh oh!

Conversation

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants